Cuckoo hashing is a scheme in computer programming for resolving of values of in a hash table, with worst-case constant time lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.
'''function''' lookup(x) '''is''' '''return''' '''end function''' |
1 '''function''' insert(x) '''is''' 2 '''if''' lookup(x) '''then''' 3 '''return''' 4 '''end if''' 5 '''loop''' Max-Loop '''times''' 6 '''if''' = '''then''' 7 := x 8 '''return''' 9 '''end if''' 10 x 11 '''if''' = '''then''' 12 := x 13 '''return''' 14 '''end if''' 15 x 16 '''end loop''' 17 rehash() 18 insert(x) 19 '''end function''' |
One method of proving this uses the theory of : one may form an undirected graph called the "cuckoo graph" that has a vertex for each hash table location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the cuckoo graph for this set of values is a pseudoforest, a graph with at most one cycle in each of its connected components. Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table. When the hash function is chosen randomly, the cuckoo graph is a random graph in the Erdős–Rényi model. With high probability, for load factor less than 1/2 (corresponding to a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2), the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys. The same theory also proves that the expected size of a connected component of the cuckoo graph is small, ensuring that each insertion takes constant expected time. However, also with high probability, a load factor greater than 1/2 will lead to a giant component with two or more cycles, causing the data structure to fail and need to be resized.
Since a theoretical random hash function requires too much space for practical usage, an important theoretical question is which practical hash functions suffice for Cuckoo hashing. One approach is to use k-independent hashing. In 2009 it was shownCohen, Jeffrey S., and Daniel M. Kane. "Bounds on the independence required for cuckoo hashing." ACM Transactions on Algorithms (2009). that -independence suffices, and at least 6-independence is needed. Another approach is to use tabulation hashing, which is not 6-independent, but was shown in 2012Pǎtraşcu, Mihai, and Mikkel Thorup. "The power of simple tabulation hashing." Journal of the ACM (JACM) 59.3 (2012): 1-50. to have other properties sufficient for Cuckoo hashing. A third approach from 2014Aumüller, Martin, Martin Dietzfelbinger, and Philipp Woelfel. "Explicit and efficient hash families suffice for cuckoo hashing with a stash." Algorithmica 70.3 (2014): 428-456. is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.
The following two tables show the insertion of some example elements. Each column corresponds to the state of the two hash tables over time. The possible insertion locations for each new value are highlighted. The last column illustrates a failed insertion due to a cycle, details below.
+ Table 1: uses h(k) |
!colspan=10 Steps |
+ Table 2: uses h′(k) |
!colspan=10 Steps |
67 replaces 75 in cell 6 |
53 replaces 50 in cell 4 |
105 replaces 100 in cell 9 |
45 replaces 53 in cell 4 |
75 replaces 67 in cell 6 |
100 replaces 105 in cell 9 |
50 replaces 45 in cell 4 |
67 replaces 75 in cell 6 |
Generalizations of cuckoo hashing that use more than two alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%.
Another generalization of cuckoo hashing called blocked cuckoo hashing uses more than one key per bucket and a balanced allocation scheme. Using just 2 keys per bucket permits a load factor above 80%.
Another variation of cuckoo hashing that has been studied is cuckoo hashing with a stash. The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the main hash table of the structure. This modification reduces the failure rate of cuckoo hashing to an inverse-polynomial function with an exponent that can be made arbitrarily large by increasing the stash size. However, larger stashes also mean slower searches for keys that are not present or are in the stash. A stash can be used in combination with more than two hash functions or with blocked cuckoo hashing to achieve both high load factors and small failure rates. The analysis of cuckoo hashing with a stash extends to practical hash functions, not just to the random hash function model commonly used in theoretical analysis of hashing.
Some people recommend a simplified generalization of cuckoo hashing called skewed-associative cache in some . "Micro-Architecture".
Another variation of a cuckoo hash table, called a cuckoo filter, replaces the stored keys of a cuckoo hash table with much shorter fingerprints, computed by applying another hash function to the keys. In order to allow these fingerprints to be moved around within the cuckoo filter, without knowing the keys that they came from, the two locations of each fingerprint may be computed from each other by a bitwise exclusive or operation with the fingerprint, or with a hash of the fingerprint. This data structure forms an approximate set membership data structure with much the same properties as a Bloom filter: it can store the members of a set of keys, and test whether a query key is a member, with some chance of (queries that are incorrectly reported as being part of the set) but no . However, it improves on a Bloom filter in multiple respects: its memory usage is smaller by a constant factor, it has better locality of reference, and (unlike Bloom filters) it allows for fast deletion of set elements with no additional storage penalty.
A survey by Mitzenmacher presents open problems related to cuckoo hashing as of 2009.
|
|